In this data analysis project we explore the dataset from the House Prices: Advanced Regression Techniques competition held on Kaggle. The dataset describes residential homes in Ames, Iowa. The goal of is project (and the Kaggle competition) is to predict the sale price of a home based on it’s other attributes. Theses attributes are represented in the dataset by an extensive 79 variables, including everything from the number of bathrooms, to the material the roof is made out of, to the year the home was built.
Our primary goal is to build a model which can predict house prices as accurately as possible. We want our model to work well with previously unseen examples. So to avoid overfitting, we will use cross validation accuracy to evaluate our candidate models.
As a secondary goal, we’d like our model to be stable and interperatable so we will also consider BIC, and Adjusted \(R^2\) when selecting a best model.
First, we load the training data and extract the number of samples and the feature names:
train_data = read.csv('data/train.csv')
n = nrow(train_data)
features = setdiff(colnames(train_data), c("Id", "SalePrice"))
The housing dataset includes 79 features and 1460 samples.
Naturally, we expect that some examples are missing features, let’s try removing those samples and see how much data we’d have left to work with:
train_data_omit_na = na.omit(train_data)
nrow(train_data_omit_na)
## [1] 0
Bummer! It appears every sample has is missing at least one feature. So we can’t simply omit them or we’ll have nothing left to work with! For factor variables with missing values, we can simply create a new other category to fill in for NA values.
other_category = 'other'
for (f in features) {
if (is.factor(train_data[[f]]) && any(is.na(train_data[[f]]))) {
levels(train_data[[f]]) = c(levels(train_data[[f]]), other_category)
train_data[[f]][is.na(train_data[[f]])] = other_category
}
}
# Try omitting samples with missing features again:
train_data_omit_na = na.omit(train_data)
nrow(train_data_omit_na)
## [1] 1121
This appears to solve most of the problem, but we’re still losing 339 samples, since they have missing numeric feature values. Since they’re a large portion (23% of the dataset), we don’t want to ignore them. Unfortunately, finding suitable replacement values for numeric features is more difficult.
Let’s see which features have missing values:
Filter(function (f) any(is.na(train_data[[f]])), features)
## [1] "LotFrontage" "MasVnrArea" "GarageYrBlt"
Great, we only need to deal with missing values for 3 numeric features. Let’s replace the NA values of these features with sensible defaults.
For MasVnrArea, the masonry veneer area in square feet, we have a large number of examples with a value of 0, so that should be a decent replacement. Since it doesn’t seem especially important, we might also consider dropping the feature entirely instead of the samples.
train_data$MasVnrArea[is.na(train_data$MasVnrArea)] = 0
Since GarageYrBlt and YearBuilt are highly correlated (0.8235195) and they’re identical 60% of the time, we’ll simply replace missing GarageYrBlt values with YearBuilt values.
train_data$GarageYrBlt[is.na(train_data$GarageYrBlt)] = train_data$YearBuilt[is.na(train_data$GarageYrBlt)]
Coming up with replacement values for LotFrontage (linear feet of street connected to property) is much more difficult. It seems too important of a feature to simply remove. There are also no cases where the value is 0, and so no obvious replacement value.
There are a few variables like LotArea, LotShape and LotConfig that seem like they might be predictive of LotFrontage. Let’s quickly build a model to predict LotFrontage and then use it to fill in our missing values.
lot_frontage_model = lm(LotFrontage ~ log(LotArea) + LotArea:LotConfig + LotShape + LandSlope + YearBuilt + BldgType, data = train_data_omit_na)
We can be confident the lot_frontage_model above is doing a good job predicting LotFrontage because:
If we wanted to be even more confident, we might check the cross validated RMSE, but because we’re only using this model to fill in missing values, we won’t spend too much time on it.
Now we can fillin the missing LotFrontage values:
train_data$LotFrontage[is.na(train_data$LotFrontage)] = predict(lot_frontage_model, newdata = train_data[is.na(train_data$LotFrontage),])
# Check if there are any remaining NA values in the training data:
any(is.na(train_data))
## [1] FALSE
Success! We’ve managed to replace all of the missing values without losing any training samples.
When we loaded the dataset, R decided which predictors were numeric and which were factors. Some predictors that R considers numeric, might work better as factor variables. For example, the price difference between a 0 and a 2 car garage is likely different than the price difference between a 2 and a 4 car garage.
for (f in features) {
# We'll add any numeric predictor with less than 12 unique values as a factor variable
if (!is.factor(train_data[[f]]) && length(unique(train_data[[f]])) <= 12) {
train_data[[paste(f, 'Fctr', sep = "")]] = as.factor(train_data[[f]])
}
}
# Update `features`:
features = setdiff(colnames(train_data), c("Id", "SalePrice"))
Great, we’ve added 14 new factor variables for a total of 93 features.
// TODO: explain this
category_cutoff = n * 0.05
for (f in features) {
if (is.factor(train_data[[f]])) {
for (category in levels(train_data[[f]])) {
samples_in_cat = sum(na.omit(train_data[[f]]) == category)
if (samples_in_cat < category_cutoff) {
levels(train_data[[f]]) = c(levels(train_data[[f]]), other_category)
train_data[[f]][train_data[[f]] == category] = other_category
}
}
}
}
feature_cutoff = n * 0.95
for (f in features) {
if (is.factor(train_data[[f]])) {
if (length(levels(train_data[[f]])) == 1) {
train_data = train_data[,!(names(train_data) == f)]
} else {
for (category in levels(train_data[[f]])) {
samples_in_cat = sum(na.omit(train_data[[f]]) == category)
if (samples_in_cat > feature_cutoff) {
train_data = train_data[,!(names(train_data) == f)]
}
}
}
}
}
Next we quiclky inspect the correlations between the numeric predictors in our dataset to determine which might be useful in predicting SalePrice and which might have collinearity issues:
numeric_variables <- train_data[which(sapply(train_data, is.numeric))]
numeric_variables <- subset(numeric_variables, select = -c(Id))
correlations <- cor(numeric_variables)
corrplot(correlations, method="square", order ="FPC")
Next we visualize the numeric predictors that are most correlated with SalePrice:
correlated_indices = abs(correlations['SalePrice',]) > 0.2
correlated_predictors = names(correlations['SalePrice',correlated_indices])
par(mfrow = c(ceiling(length(correlated_predictors) / 5.0), 5), mar=c(1,1,1,1), oma = c(0, 0, 3, 0))
for (cor_pred in correlated_predictors) {
plot(train_data[[cor_pred]], train_data$SalePrice,
main=cor_pred, col=rgb(0.1,0.1,0.1,0.3), pch=16,
xaxt='n', yaxt='n', xlab = NA, ylab = NA)
}
mtext("Predictors' Relationships to SalePrice", outer=TRUE, cex = 2)
As an evaluation metric, the Kaggle competition uses Root-Mean-Squared-Error (RMSE) on the logorithm of the predicted and actual sale prices. Taking the logorithm ensures that expensive houses don’t have more of an impact on RMSE than cheap houses.
We’ll use this same metric as our primary quality indicator. To ensure we don’t overfit, we’ll apply it on a 5-fold cross validation set. We’ll also keep our eye on BIC, Adjusted \(R^2\) and a few other metrics.
get_bp_decision = function(model, alpha = 0.05) {
decide = unname(bptest(model)$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_shapiro_decision = function(model, alpha = 0.05) {
decide = unname(shapiro.test(resid(model))$p.value < alpha)
ifelse(decide, "Reject", "Fail to Reject")
}
get_log_rmse = function(observed, predicted) {
sqrt(mean((log(predicted) - log(observed)) ^ 2))
}
get_cv_log_rmse = function(model_formula, folds = 5) {
total_rmse = 0
for (i in 1:folds) {
idx = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.8, 0.2))
training_set = train_data[idx,]
cv_set = train_data[!idx,]
model = cached_train(model_formula, data=training_set, method="glm")
cv_predicted = predict(model, newdata = cv_set)
# For stability, ignore all predictions are < 1.0
nan_indices = is.nan(log(cv_predicted))
total_rmse = total_rmse + get_log_rmse(cv_set$SalePrice[!nan_indices], cv_predicted[!nan_indices])
}
total_rmse / folds
}
evaluate_model = function(name, model) {
set.seed(42)
data.frame(Model = name,
CV.Log.RMSE = get_cv_log_rmse(model$terms),
Adj.R2 = rsq(model, adj=TRUE),
BIC = BIC(model),
Coefficients = length(coef(model)),
Shapiro.Decision = get_shapiro_decision(model),
BP.Decision = get_bp_decision(model)
)
}
We can see that our response variable SalePrice has a righ skew. A log transform can be used to adjust and normalize the distribution.
par(mfrow=c(1,2))
hist(train_data$SalePrice,main = "SalesPrice")
hist(log(train_data$SalePrice),main = "log(SalesPrice)")
# Start with a simple, naive model using all predictors added together
additive_large = lm(SalePrice ~ ., data = train_data)
additive_large_eval = evaluate_model("Additive Large", additive_large)
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/1dc7dcfe2294b58c7db2d8267acde55d53154dc1.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/e48f7dab9a7fa924a851ed73636655b162e8129d.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/acbaf62b1e90944ba172fd50696957ad44848541.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/c77232ccfd0247067ea5946281d8b361174baa27.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/c77232ccfd0247067ea5946281d8b361174baa27.rds"
kable(additive_large_eval)
| Model | CV.Log.RMSE | Adj.R2 | BIC | Coefficients | Shapiro.Decision | BP.Decision |
|---|---|---|---|---|---|---|
| Additive Large | 3.909763 | 0.880485 | 35015.8 | 172 | Reject | Reject |
# Variables were selected based on the results of the Data Exploration section above
additive_small = lm(SalePrice ~
LotArea + X1stFlrSF + X2ndFlrSF + MasVnrArea + BedroomAbvGr + OverallQual +
OverallCond + ExterQual + KitchenQual + BsmtQual + BsmtExposure + ScreenPorch +
Fireplaces + MSZoning + YearBuilt, data = train_data)
additive_small_eval = evaluate_model("Additive Small", additive_small)
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/1dc7dcfe2294b58c7db2d8267acde55d53154dc1.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/e48f7dab9a7fa924a851ed73636655b162e8129d.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/acbaf62b1e90944ba172fd50696957ad44848541.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/c77232ccfd0247067ea5946281d8b361174baa27.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/c77232ccfd0247067ea5946281d8b361174baa27.rds"
kable(additive_small_eval)
| Model | CV.Log.RMSE | Adj.R2 | BIC | Coefficients | Shapiro.Decision | BP.Decision |
|---|---|---|---|---|---|---|
| Additive Small | 3.909763 | 0.8266707 | 34694.53 | 25 | Reject | Reject |
We can do much better on our target metric (CV.Log.RMSE) with the smaller model. Notice that the small additive model has far fewer parameters than the large one (25 vs 172) It seems that with so many predictors, it’s very easy to overfit to the training data.
The small baseline model also has a better BIC. So it looks like BIC might be a good metric to use to avoid overfitting. Next, we’ll use BIC to do a backwards parameter search starting from additive_large model.
additive_bic_backward = cached_step(additive_large, k = log(n), direction = 'backward', trace = 0)
## [1] "[cache] loaded step model from ~/git/STAT420/final-project/cache/923e6ff8a018aa605fe3e56572395ef5434e8a71.rds"
additive_bic_backward_eval = evaluate_model("Additive Backward BIC", additive_bic_backward)
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/1dc7dcfe2294b58c7db2d8267acde55d53154dc1.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/e48f7dab9a7fa924a851ed73636655b162e8129d.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/acbaf62b1e90944ba172fd50696957ad44848541.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/c77232ccfd0247067ea5946281d8b361174baa27.rds"
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/c77232ccfd0247067ea5946281d8b361174baa27.rds"
kable(additive_bic_backward_eval)
| Model | CV.Log.RMSE | Adj.R2 | BIC | Coefficients | Shapiro.Decision | BP.Decision |
|---|---|---|---|---|---|---|
| Additive Backward BIC | 3.909763 | 0.8720961 | 34394.81 | 48 | Reject | Reject |
Doing a parameter search using backwards BIC managed improve our results, but it still fails the Shapiro-Wilks and Breusch-Pagan tests. This suggests we might be able to achieve better results by applying some transformations to our the response and/or predictors.
Based on the graphs in the Data Exploration section above, it appears that many predictors have an exponential or polynomial relationship with SalePrice.
numeric_variables <- names(which(sapply(train_data, is.numeric)))
categorical_variables <- setdiff(names(train_data),(numeric_variables))
# remove Id
numeric_variables <- numeric_variables[numeric_variables!= "Id" & numeric_variables!= "SalePrice"]
numeric_variables_data <- subset(train_data[numeric_variables])
par(mfrow=c(2,2))
tmp<-sapply(numeric_variables,function(var){
hist(train_data[[var]],main = var)
hist(log(train_data[[var]]),main = paste0("log(",var,")"))
})
## Warning in log(train_data[[var]]): NaNs produced
should_log_transform<-c('MSSubClass','LotArea','BsmtFinSF1','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','X1stFlrSF','X2ndFlrSF','LowQualFinSF','GrLivArea','BedroomAbvGr','TotRmsAbvGrd','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','X3SsnPorch','ScreenPorch','PoolArea','MiscVal','MoSold')
new_train_data<-train_data
# apply log transforms to predictors
tmp<-sapply(should_log_transform,function(var){
new_train_data[[var]] <<- log(1+ new_train_data[[var]] )
})
all_factor_column_names<-build_column_factor_map(d = new_train_data)
model_additive = lm(log(SalePrice) ~ . -Id-MSSubClass, data = new_train_data)
pvals<-coef(summary(model_additive))[,"Pr(>|t|)"]
# get pvals less than .001 and make a new formula
pval_cols<-data.frame(var=names(pvals[pvals<0.001]))
pval_cols_used<-unique(merge(all_factor_column_names,pval_cols,by.x = "var",by.y = "var")$val)
pval_formula<-as.formula(paste("log(SalePrice)~", paste(pval_cols_used, collapse="+")))
influential_indexes <- as.vector(which(cooks.distance(model_additive) > (4 / length(cooks.distance(model_additive)))))
pval_model<-lm(pval_formula, data=new_train_data[-influential_indexes])
pval_model_eval<-evaluate_model("pval",pval_model)
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/1dc7dcfe2294b58c7db2d8267acde55d53154dc1.rds"
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/e48f7dab9a7fa924a851ed73636655b162e8129d.rds"
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/acbaf62b1e90944ba172fd50696957ad44848541.rds"
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/c77232ccfd0247067ea5946281d8b361174baa27.rds"
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## [1] "[cache] loaded train model from ~/git/STAT420/final-project/cache/c77232ccfd0247067ea5946281d8b361174baa27.rds"
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
p <- predict(pval_model, newdata = new_train_data[-influential_indexes])
## Warning in predict.lm(pval_model, newdata = new_train_data[-
## influential_indexes]): prediction from a rank-deficient fit may be
## misleading
plot(exp(p),
new_train_data$SalePrice,
col = "dodgerblue",
pch = 20,
main = "Prediction vs Actual",
xlab = "Prediction",
ylab = "Actual"
)
# Results
model_results = rbind(additive_large_eval, additive_small_eval, additive_bic_backward_eval,pval_model_eval)
kable(model_results)
| Model | CV.Log.RMSE | Adj.R2 | BIC | Coefficients | Shapiro.Decision | BP.Decision |
|---|---|---|---|---|---|---|
| Additive Large | 3.909763 | 0.8804850 | 35015.803 | 172 | Reject | Reject |
| Additive Small | 3.909763 | 0.8266707 | 34694.528 | 25 | Reject | Reject |
| Additive Backward BIC | 3.909763 | 0.8720961 | 34394.805 | 48 | Reject | Reject |
| pval | 3.909763 | 0.8472417 | -1102.703 | 28 | Reject | Reject |
best_model = additive_bic_backward # TODO: replace with model that's actually the best
# Fitted vs. Residuals Plot
plot(fitted(best_model), resid(best_model), col = 'grey',
xlab = "Fitted", ylab = "Residuals", main = "Fitted vs. Residuals")
abline(h = 0, col = 'orange', lwd = 2)
# Q-Q Plot
qqnorm(resid(best_model), main = "Normal Q-Q Plot", col = 'grey')
qqline(resid(best_model), col = 'dodgerblue', lwd = 2)
We would like to thank Professor David Unger for his input has been invaluable during the course and lecture hours.
Also Professor David Dalpiaz for the material covered in the course.